176 research outputs found

    Boundary Crossing Probabilities for General Exponential Families

    Get PDF
    We consider parametric exponential families of dimension KK on the real line. We study a variant of \textit{boundary crossing probabilities} coming from the multi-armed bandit literature, in the case when the real-valued distributions form an exponential family of dimension KK. Formally, our result is a concentration inequality that bounds the probability that Bψ(θ^n,θ⋆)≥f(t/n)/n\mathcal{B}^\psi(\hat \theta_n,\theta^\star)\geq f(t/n)/n, where θ⋆\theta^\star is the parameter of an unknown target distribution, θ^n\hat \theta_n is the empirical parameter estimate built from nn observations, ψ\psi is the log-partition function of the exponential family and Bψ\mathcal{B}^\psi is the corresponding Bregman divergence. From the perspective of stochastic multi-armed bandits, we pay special attention to the case when the boundary function ff is logarithmic, as it is enables to analyze the regret of the state-of-the-art \KLUCB\ and \KLUCBp\ strategies, whose analysis was left open in such generality. Indeed, previous results only hold for the case when K=1K=1, while we provide results for arbitrary finite dimension KK, thus considerably extending the existing results. Perhaps surprisingly, we highlight that the proof techniques to achieve these strong results already existed three decades ago in the work of T.L. Lai, and were apparently forgotten in the bandit community. We provide a modern rewriting of these beautiful techniques that we believe are useful beyond the application to stochastic multi-armed bandits

    Concentration inequalities for sampling without replacement

    Get PDF
    Concentration inequalities quantify the deviation of a random variable from a fixed value. In spite of numerous applications, such as opinion surveys or ecological counting procedures, few concentration results are known for the setting of sampling without replacement from a finite population. Until now, the best general concentration inequality has been a Hoeffding inequality due to Serfling [Ann. Statist. 2 (1974) 39-48]. In this paper, we first improve on the fundamental result of Serfling [Ann. Statist. 2 (1974) 39-48], and further extend it to obtain a Bernstein concentration bound for sampling without replacement. We then derive an empirical version of our bound that does not require the variance to be known to the user.Comment: Published at http://dx.doi.org/10.3150/14-BEJ605 in the Bernoulli (http://isi.cbs.nl/bernoulli/) by the International Statistical Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm

    Sequential change-point detection: Laplace concentration of scan statistics and non-asymptotic delay bounds

    Get PDF
    International audienceWe consider change-point detection in a fully sequential setup, when observations are received one by one and one must raise an alarm as early as possible after any change. We assume that both the change points and the distributions before and after the change are unknown. We consider the class of piecewise-constant mean processes with sub-Gaussian noise, and we target a detection strategy that is uniformly good on this class (this constrains the false alarm rate and detection delay). We introduce a novel tuning of the GLR test that takes here a simple form involving scan statistics, based on a novel sharp concentration inequality using an extension of the Laplace method for scan-statistics that holds doubly-uniformly in time. This also considerably simplifies the implementation of the test and analysis. We provide (perhaps surprisingly) the first fully non-asymptotic analysis of the detection delay of this test that matches the known existing asymptotic orders, with fully explicit numerical constants. Then, we extend this analysis to allow some changes that are not-detectable by any uniformly-good strategy (the number of observations before and after the change are too small for it to be detected by any such algorithm), and provide the first robust, finite-time analysis of the detection delay

    Selecting Near-Optimal Approximate State Representations in Reinforcement Learning

    Full text link
    We consider a reinforcement learning setting introduced in (Maillard et al., NIPS 2011) where the learner does not have explicit access to the states of the underlying Markov decision process (MDP). Instead, she has access to several models that map histories of past interactions to states. Here we improve over known regret bounds in this setting, and more importantly generalize to the case where the models given to the learner do not contain a true model resulting in an MDP representation but only approximations of it. We also give improved error bounds for state aggregation

    Linear Regression with Random Projections

    Get PDF
    International audienceWe investigate a method for regression that makes use of a randomly generated subspace GPG_P (of finite dimension PP) of a given large (possibly infinite) dimensional function space FF, for example, L2([0,1]d)L_{2}([0,1]^d). GPG_P is defined as the span of PP random features that are linear combinations of a basis functions of FF weighted by random Gaussian i.i.d.~coefficients. We show practical motivation for the use of this approach, detail the link that this random projections method share with RKHS and Gaussian objects theory and prove, both in deterministic and random design, approximation error bounds when searching for the best regression function in GPG_P rather than in FF, and derive excess risk bounds for a specific regression algorithm (least squares regression in GPG_P). This paper stresses the motivation to study such methods, thus the analysis developed is kept simple for explanations purpose and leaves room for future developments

    Risk-aware linear bandits with convex loss

    Full text link
    In decision-making problems such as the multi-armed bandit, an agent learns sequentially by optimizing a certain feedback. While the mean reward criterion has been extensively studied, other measures that reflect an aversion to adverse outcomes, such as mean-variance or conditional value-at-risk (CVaR), can be of interest for critical applications (healthcare, agriculture). Algorithms have been proposed for such risk-aware measures under bandit feedback without contextual information. In this work, we study contextual bandits where such risk measures can be elicited as linear functions of the contexts through the minimization of a convex loss. A typical example that fits within this framework is the expectile measure, which is obtained as the solution of an asymmetric least-square problem. Using the method of mixtures for supermartingales, we derive confidence sequences for the estimation of such risk measures. We then propose an optimistic UCB algorithm to learn optimal risk-aware actions, with regret guarantees similar to those of generalized linear bandits. This approach requires solving a convex problem at each round of the algorithm, which we can relax by allowing only approximated solution obtained by online gradient descent, at the cost of slightly higher regret. We conclude by evaluating the resulting algorithms on numerical experiments

    Optimal Regret Bounds for Selecting the State Representation in Reinforcement Learning

    Get PDF
    We consider an agent interacting with an environment in a single stream of actions, observations, and rewards, with no reset. This process is not assumed to be a Markov Decision Process (MDP). Rather, the agent has several representations (mapping histories of past interactions to a discrete state space) of the environment with unknown dynamics, only some of which result in an MDP. The goal is to minimize the average regret criterion against an agent who knows an MDP representation giving the highest optimal reward, and acts optimally in it. Recent regret bounds for this setting are of order O(T2/3)O(T^{2/3}) with an additive term constant yet exponential in some characteristics of the optimal MDP. We propose an algorithm whose regret after TT time steps is O(T)O(\sqrt{T}), with all constants reasonably small. This is optimal in TT since O(T)O(\sqrt{T}) is the optimal regret in the setting of learning in a (single discrete) MDP

    Adaptive Bandits: Towards the best history-dependent strategy

    Get PDF
    Ce document a été accepté pour publication à AI&Statistics 2011. Je dois me référer à ce travail, je me réfère donc à ceci en attendant de publier la version camera-ready (le mois prochain).We consider multi-armed bandit games with possibly adaptive opponents. We introduce models Theta of constraints based on equivalence classes on the common history (information shared by the player and the opponent) which dene two learning scenarios: (1) The opponent is constrained, i.e. he provides rewards that are stochastic functions of equivalence classes dened by some model theta*\in Theta. The regret is measured with respect to (w.r.t.) the best history-dependent strategy. (2) The opponent is arbitrary and we measure the regret w.r.t. the best strategy among all mappings from classes to actions (i.e. the best history-class-based strategy) for the best model in Theta. This allows to model opponents (case 1) or strategies (case 2) which handles nite memory, periodicity, standard stochastic bandits and other situations. When Theta={theta}, i.e. only one model is considered, we derive tractable algorithms achieving a tight regret (at time T) bounded by ~O(sqrt(TAC)), where C is the number of classes of theta. Now, when many models are available, all known algorithms achieving a nice regret O(sqrt(T)) are unfortunately not tractable and scale poorly with the number of models |Theta|. Our contribution here is to provide tractable algorithms with regret bounded by T^{2/3}C^{1/3} log(|Theta|)^{1/2}
    • …
    corecore